inference acceleration
Search for Efficient Large Language Models
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs.However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs.In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration.Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data.Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks.Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration.
Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles
Wei, Qingyan, Zhang, Yaojie, Liu, Zhiyuan, Liu, Dongrui, Zhang, Linfeng
Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.
Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding
Zhang, Jiaru, Lu, Juanwu, Wang, Ziran, Zhang, Ruqi
Normalizing flows are promising generative models with advantages such as theoretical rigor, analytical log-likelihood computation, and end-to-end training. However, the architectural constraints to ensure invertibility and tractable Jacobian computation limit their expressive power and practical usability. Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality. However, such sequential modeling inherently restricts parallel computation during inference, leading to slow generation that impedes practical deployment. In this paper, we first identify that strict sequential dependency in inference is unnecessary to generate high-quality samples. We observe that patches in sequential modeling can also be approximated without strictly conditioning on all preceding patches. Moreover, the models tend to exhibit low dependency redundancy in the initial layer and higher redundancy in subsequent layers. Leveraging these observations, we propose a selective Jacobi decoding (SeJD) strategy that accelerates autoregressive inference through parallel iterative optimization. Theoretical analyses demonstrate the method's superlinear convergence rate and guarantee that the number of iterations required is no greater than the original sequential approach. Empirical evaluations across multiple datasets validate the generality and effectiveness of our acceleration technique. Experiments demonstrate substantial speed improvements up to 4.7 times faster inference while keeping the generation quality and fidelity.
Search for Efficient Large Language Models
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs.However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs.In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration.Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data.Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks.Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration.
A Theoretical Perspective for Speculative Decoding Algorithm
Yin, Ming, Chen, Minshuo, Huang, Kaixuan, Wang, Mengdi
Transformer-based autoregressive sampling has been the major bottleneck for slowing down large language model inferences. One effective way to accelerate inference is \emph{Speculative Decoding}, which employs a small model to sample a sequence of draft tokens and a large model to validate. Given its empirical effectiveness, the theoretical understanding of Speculative Decoding is falling behind. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, \emph{output quality and inference acceleration}, from a theoretical perspective. Our analysis covers the theoretical limits of speculative decoding, batch algorithms, and output quality-inference acceleration tradeoffs. Our results reveal the fundamental connections between different components of LLMs via total variation distances and show how they jointly affect the efficiency of decoding algorithms.
- South America > Brazil (0.04)
- Asia > Middle East > Qatar (0.04)
Parallel Speculative Decoding with Adaptive Draft Length
Liu, Tianyu, Li, Yun, Lv, Qitan, Liu, Kai, Zhu, Jianchen, Hu, Winston
Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely \textbf{P}arallel sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft \textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to verify the first draft token in advance during the drafting phase, and \textit{post-verify} to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Moreover, we theoretically demonstrate that the mean accepted tokens of PEARL is more than existing \textit{draft-then-verify} works. Experiments on various text generation benchmarks demonstrate the effectiveness of our \name, leading to a superior speedup performance up to \textbf{3.79$\times$} and \textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
- North America > United States (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
Inference acceleration for large language models using "stairs" assisted greedy generation
Grigaliūnas, Domas, Lukoševičius, Mantas
Large Language Models (LLMs) with billions of parameters are known for their impressive predicting capabilities but require lots of resources to run. With their massive rise in popularity, even a small reduction in required resources could have an impact on environment. On the other hand, smaller models require fewer resources but may sacrifice accuracy. In this work, we are proposing an implementation of ``stairs'' assisted greedy generation. It is a modified assisted generation methodology that makes use of a smaller model's fast generation, large model's batch prediction, and "stairs" validation in order to achieve a speed up in prediction generation. Results show between 9.58 and 17.24 percent inference time reduction compared to a stand-alone large LLM prediction in a text generation task without a loss in accuracy.
- Europe > Lithuania > Kaunas County > Kaunas (0.05)
- North America > United States > California > San Diego County > San Diego (0.04)
Inference Acceleration for Large Language Models on CPUs
PS, Ditto, VG, Jithin, MS, Adarsh
In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands. In this paper, we explore the utilization of CPUs for accelerating the inference of large language models. Specifically, we introduce a parallelized approach to enhance throughput by 1) Exploiting the parallel processing capabilities of modern CPU architectures, 2) Batching the inference request. Our evaluation shows the accelerated inference engine gives an 18-22x improvement in the generated token per sec. The improvement is more with longer sequence and larger models. In addition to this, we can also run multiple workers in the same machine with NUMA node isolation to further improvement in tokens/s. Table 2, we have received 4x additional improvement with 4 workers. This would also make Gen-AI based products and companies environment friendly, our estimates shows that CPU usage for Inference could reduce the power consumption of LLMs by 48.9% while providing production ready throughput and latency.
Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing
Li, Nan, Iosifidis, Alexandros, Zhang, Qi
This paper studies inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing. To ensure inference accuracy in inference task partitioning, we consider the receptive-field when performing segment-based partitioning. To maximize the parallelization between the communication and computing processes, thereby minimizing the total inference time of an inference task, we design a novel task collaboration scheme in which the overlapping zone of the sub-tasks on secondary edge servers (ESs) is executed on the host ES, named as HALP. We further extend HALP to the scenario of multiple tasks. Experimental results show that HALP can accelerate CNN inference in VGG-16 by 1.7-2.0x for a single task and 1.7-1.8x for 4 tasks per batch on GTX 1080TI and JETSON AGX Xavier, which outperforms the state-of-the-art work MoDNN. Moreover, we evaluate the service reliability under time-variant channel, which shows that HALP is an effective solution to ensure high service reliability with strict service deadline.
Xilinx Selects Mipsology Zebra Software to Accelerate Alveo U50 FPGA – IAM Network
AI software innovator Mipsology today announced that its Zebra neural network accelerating software has been integrated into the latest build of Xilinx's Alveo U50 data center accelerator card, the industry's first low profile adaptable accelerator with PCIe Gen 4 support. Zebra's ease-of-use and high throughput enable the Alveo U50 to compute convolutional neural networks with zero effort. This is the latest in a series of Zebra-enhanced Xilinx boards that enable inference acceleration for a wide variety of sophisticated AI applications. "The level of acceleration that Zebra brings to our Alveo cards puts CPU and GPU accelerators to shame," said Ramine Roane, Xilinx's Vice President of marketing. "Combined with Zebra, Alveo U50 meets the flexibility and performance needs of AI workloads and offers high throughput and low latency performance advantages to any deployment."